.. _`Text Count Vectorizer`: .. _`org.sysess.sympathy.machinelearning.count_vectorizer`: Text Count Vectorizer ````````````````````` .. image:: count_vectorizer.svg :width: 48 Convert a collection of text documents to a matrix of token counts Documentation ::::::::::::: Attributes ========== **stop_words_** Terms that were ignored because they either: - occurred in too many documents (`max_df`) - occurred in too few documents (`min_df`) - were cut off by feature selection (`max_features`). This is only available if no vocabulary was given. **vocabulary_** A mapping of terms to feature indices. Definition :::::::::: Output ports ============ **model** model Model Configuration ============= **Analyzer** (analyzer) Whether the feature should be made of word n-gram or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. .. versionchanged:: 0.21 Since v0.21, if ``input`` is ``filename`` or ``file``, the data is first read from the file and then passed to the given callable analyzer. **Binary** (binary) If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. **Decoding error behavior** (decode_error) Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'. **Encoding** (encoding) If bytes or files are given to analyze, this encoding is used to decode. **Lowercase** (lowercase) Convert all characters to lowercase before tokenizing. **Maximum document frequency** (max_df) When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. **Maximum features** (max_features) If not None, build a vocabulary that only consider the top `max_features` ordered by term frequency across the corpus. Otherwise, all features are used. This parameter is ignored if vocabulary is not None. **Minimum document frequency** (min_df) When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. **N-gram range** (ngram_range) The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means unigrams and bigrams, and ``(2, 2)`` means only bigrams. Only applies if ``analyzer`` is not callable. **Stop words** (stop_words) If 'english', a built-in stop word list for English is used. There are several known issues with 'english' and you should consider an alternative (see stop_words). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. In this case, setting `max_df` to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms. **Strip accents** (strip_accents) Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have a direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) means no character normalization is performed. Both 'ascii' and 'unicode' use NFKD normalization from :func:`unicodedata.normalize`. Implementation ============== .. automodule:: node_text :noindex: .. class:: CountVectorizer :noindex: